Test-retest Repeatability and Interobserver Variation of Healthy Tissue Metabolism using FDG-PET/CT of the Thorax among Lung Cancer Patients

Objectives The aim of this study was to assess the test-retest repeatability and interobserver variation in healthy tissue (HT) metabolism using 2-deoxy-2-[18F]fluoro-D-glucose (18F-FDG) positron emission tomography/computed tomography (PET/CT) of the thorax in lung cancer patients. Methods A retrospective analysis was conducted in 22 patients with non-small cell lung cancer who had two PET/CT scans of the thorax performed three days apart with no interval treatment. The maximum, mean and peak standardized uptake values (SUV) in different HTs were measured by a single observer for the test-retest analysis and two observers for interobserver variation. Bland-Altman plots were used to assess the repeatability and interobserver variation. Intrasubject variability was evaluated using within-subject coefficients of variation (wCV). Results The wCV of test-retest SUVmean measurements in mediastinal blood pool, bone marrow, skeletal muscles and lungs were <20%. The left ventricle (LV) showed higher wCV (>60%) in all SUV parameters with wide limits of repeatability. High interobserver agreement was found with wCV of <10% in SUVmean of all HT, but up to 22% was noted in the LV. Conclusion HT metabolism is stable in a test-retest scenario and has high interobserver agreement. SUVmean was the most stable metric in organs with low FDG-uptake and SUVpeak in HTs with moderate uptake. Test-retest measurements in LV were highly variable irrespective of the SUV parameters used for measurements.


Introduction
The uptake of 2-deoxy-2-[ 18 F]fluoro-D-glucose ( 18 F-FDG) reflects glucose metabolism, not only in inflammatory and malignant tissues, but also in healthy tissues (HT) such as lung, myocardium, liver, spleen and bone marrow. Several studies have reported 18 F-FDG uptake using positron emission tomography/computed tomography (PET/CT) in HT either before or during treatment to be linked to different factors including potential treatment-related adverse events [1][2][3][4][5]. Monitoring cancer response to therapy as well as their effects on HTs has become even more relevant with the increasing use of immunotherapy. Immunotherapy, such as the immune check-point inhibitors, used to treat various solid tumours and some haematological malignancies can activate the immune system causing unique patterns of 18 F-FDG uptake in tumour as well as HT reflecting the induced inflammatory response [6,7]. These patterns in conjunction with changes in tumour metabolism may predict response to treatment or indicate treatment-related adverse events. Thus, expanding the utilization of 18 F-FDG PET/CT to evaluate not only tumour but also healthy tissue metabolism during treatment may have potential to predict side effects and improve management of oncology patients.
As 18 F-FDG distribution in the body is non-specific [8], it is important to distinguish variations due to physiological changes or measurement error from abnormal or true changes in tissue metabolism when assessing cancer patients undergoing different treatment modalities. Few studies in the literature have evaluated variations in HT metabolism [9,10] and most have focused on specific organs such as the liver as a reference organ [11,12] or interventional treatments performed between scans [13][14][15]. Knowledge of variations in HT metabolism measured by 18 F-FDG without interval treatment in a test-retest setting is still very limited. Furthermore, consistency in the interpretation and measurement of 18 F-FDG uptake in HTs between reporters is important, especially as no standardised method for measurement has yet been agreed upon [9][10]14].
The first aim of this study was to evaluate the test-retest repeatability of 18 F-FDG as a surrogate for HT metabolism in the thoracic area using different standardized uptake value (SUV) parameters commonly applied in PET/CT imaging for cancer patients who received no intervening treatment. The second aim was to assess the interobserver variation in HT metabolism using 18 F-FDG PET/CT and suggest suitable methods for measurement of HT metabolism in future studies. Rigshospitalet in Copenhagen [16]. For the analysis, areas of interest that were outside of the scanning field of view, showed artefacts or disease involvement were excluded.
All patients gave their informed consent in writing for the scientific use of their data. Study approval was obtained from the Danish Ethics Committee (protocol number H-1-2014-011) and the Danish Data Protection Agency (02986/30-1271) [16]. Study methods were performed in accordance with the relevant guidelines and regulations.

Image Acquisition
All patients were instructed to fast for at least 4 hours prior to examination. Patients were administered 4 MBq/kg of 18 F-FDG and positioned in the radiotherapy treatment position for scanning with both arms placed over the head. Two PET/CT scans were performed 2 to 5 days apart with no interval treatment. The scans were aimed to be acquired at the same time of the day. On both days the patients had a thoracic PET scan with low-dose CT on the same PET/CT scanner (Siemens Biograph mCT, Siemens Healthineers Erlangen) following the guidelines of the European Association of Nuclear Medicine (EANM) [16][17].
For PET acquisition, 2 -3 minutes per bed position was applied for patients according to body mass index (BMI). Iterative reconstruction was used to correct for attenuation and scatter in PET images with 3D-ordered-subset expectation-maximisation technique which involved point spread function and time of flight. PET images were reconstructed with pixel sizes of 2 x 2 mm and slice thickness of 2 mm. Low dose CT scans were acquired in 3 to 4 seconds using 120 kV and 40 mAs and were subsequently used for attenuation correction of PET images. Detailed information about image acquisition and procedure was described in previous published work by Nygård et al [16].

Image Analysis
For assessment of test-retest repeatability using the acquired free breathing PET/CT scans the following healthy tissue regions were evaluated: mediastinal blood pool (MBP), left ventricle (LV), bone marrow (BM), skeletal muscle (SM), lungs divided into right upper zone (RUZ), right middle zone (RMZ), right lower zone (RLZ), left upper zone (LUZ), left middle zone (LMZ) and left lower zone (LLZ). A Mirada XD ® workstation version 1.1.0.3.1 (Mirada Medical, Oxford, UK) was used to measure the maximum SUV (SUV max ), mean SUV (SUV mean ), and peak SUV (SUV peak ). SUVs were measured using a 1.5 cm sphere as a volume of interest (VOI) or a manually contoured region of interest (ROI). Evaluation of each organ is described in more details in (Table 1) Table 1. Furthermore as patients included in this study did not adhere to the strict diet recommended for cardiac studies [18] measurement of myocardial uptake was supplemented with a simple visual score (0= no uptake, less than or equal to MBP, 1= patchy uptake above MBP, 2= diffusely increased LV uptake above MBP). All PET/CT scans were evaluated by a single observer (nuclear medicine technologist) for the test-retest repeatability and by two observers for the analysis of inter-observer variation i.e. a nuclear medicine physician (observer 1) with over 10 years experience and the nuclear medicine technologist (observer 2). Both observers analysed all 22 scans, the order of which was randomly selected (scan 1 or scan 2) for each patient. The visual scoring for myocardium was performed by the nuclear medicine physician.

Statistical Analysis
Descriptive statistics including mean and standard deviation (SD) for all SUV parameters (SUV max , SUV mean and SUV peak ) of each organ on scan 1 and scan 2 were calculated. Scatter plots were created to illustrate the distribution of the SUV parameters in each organ for the test-retest scans. A standard 5% significance level was corrected to P <0.0015 using the Bonferroni method with 33 tested parameters to take the multiple comparisons.
Repeatability was defined as the difference between scan 1 and scan 2 in individual patients with the mean ± SD of the differences calculated for each SUV parameter. Further repeatability analysis requires that the difference between the paired observations follow a normal distribution which was assessed with Shapiro-Wilk test [19]. Natural log-transformation was used for the subsequent analysis as SUV measurements tend to be log-normal distributed [19][20]. Paired t-test was used to investigate any significant bias in the differences or log difference. The difference in log-transformed data d ln was assessed as follows: where SUV 1 and SUV 2 denote SUV from the same organ in scan 1 and scan 2, respectively. To assess the repeatability of a single measurement, the within-subject standard deviation (wSD ln ) was obtained using the SD of the log-transformed difference and exponentiation was then applied to calculate the within-subject coefficient of variation (wCV%) as a percentage as follows [21]: wSD ln = SD dln / 2 (2) wCV% = (exp(wSD ln ) − 1) × 100 (3) The 95% repeatability coefficient (RC ln ) was calculated on the log-transformed data and exponentiation was applied to determine its upper and lower 95% RC in percentage as RC ln = ± 1.96 × SD(d ln ) (4) RC = (exp(±RC ln ) − 1) × 100 (5) The 95% confidence intervals (CIs) of the upper and lower RCs were also calculated using χ 2 distribution [21].
Bland-Altman plots were computed for the log-transformed differences against their means with their 95% CIs and the 95% upper and lower limits of RC. Linear regression analysis was used to assess the effect of the mean on the difference which may indicate any proportional bias. Trends of differences against the mean were investigated using Pearson's Correlation coefficient, on both original and log-transformed data. An additional investigation of trends in variance of differences with mean was assessed using Kendall's tau to correlate absolute differences against mean in the original and log-transformed data. Similar statistical methods as outlined above were applied for the interobserver analyses. Student paired t-test was used to compare the averages in weight, administered activity and uptake time between the two scans. Statistical analyses were performed using IBM SPSS Statistics for Windows, version 26 (IBM Corp., Armonk, N.Y., USA).

Results
There were 7 female and 15 male patients with histologically confirmed NSCLC all with BMI ≤ 30 and no patients had type I diabetes. Patients characteristics are shown in (Table  2) [16]. In four patients, SUV measurements of the LV were excluded because part of the myocardium was outside the scanning field. SUVs of the lung parenchyma were not measured due to disease involvement in the LUZ, LMZ, LLZ and RUZ in two, two, one and four patients respectively.

Test-retest Repeatability
We found no significant changes in the average weight, administered activity and uptake time after tracer administration P = 0.162, P = 0.332 and P = 0.719 between the two scans.
The paired t-test showed no significant bias in the differences (all P >0.0015). Although the t-test can be robust, some violations of assumptions of normal distributions were found, thus, the nonparametric Wilcoxon signed rank test was also applied, though this did not change the results as shown in Table 3 and (see Table S1, supplemental digital content [SDC], which illustrates further repeatability analyses).
The mean of the differences for SUV parameters in all HTs between the two scans were small, ranging from -0.13 to +0.11, except in the LV the differences in the means between scans varied from +1.33 to +1.47 (Table 3). Differences in MBP measurements between the two scans had high repeatability with the lowest intra-subject variation, remaining within ~10% and the 95% RCs within -23.9% and +31.4% for all SUV measurements (Table 3) and ( Fig. 1). Less stability between the intra-subject measurements were observed in the BM, SM and the lungs assessed by wCV which ranged from 9.9% to 31.3% and corresponded to wider limits of agreement for all SUV parameters as shown in Table 3 ( Fig. S1 -S10, SDC, which include Bland-Altman and scatter plots for test-retest of the different tissues). The highest intra-subject variation was associated with measurements in LV with wCVs varying from 63.6% to 69.0% and wide 95% limits of repeatability from -76.7% to +328.3% (Table  3) and (Fig. 1). Based on visual scoring, myocardial uptake similar to MBP (score 1) was seen in 10 patients on scan 1 and 5 on scan 2, uptake higher than MBP with patchy pattern (score 2) in 1 patient on scan 1 and scan 2 and diffuse high uptake (score 3) was seen in 7 patients on scan 1 and 12 on scan 2. All patients with changes in the uptake relative to MBP between the two scans had diffusely increased uptake in the second scan (5 patients from score 1 to 3).
SUV mean measurements based on wCVs were more stable between the two scans compared to SUV max and SUV peak in the lungs. SUV peak was the most stable in MBP and BM with similar stability to SUV mean observed in the SM, but all SUV parameters were highly variable in the LV (Table 3).
Pearson's Correlation for the difference against the mean showed only a strictly significant trend in SUV mean of the BM with moderate negative correlation in both original and logtransformed data (r = -0.64, P = 0.001 and r = -0.66, P <0.001 respectively) (Table S2 and Fig. S1b, SDC, for SUV mean of BM). Pearson's correlation can be sensitive to violations of assumptions of normality though for completeness the nonparametric Kendall's tau was also applied, confirming the similar trend in BM SUV mean . The variance of the difference relative to the mean assessed by Kendall's tau showed only one measure with a strictly significant positive correlation (tau = 0.59, P <0.001) in SUV max of lung RMZ in the untransformed data (Table S2, SDC). The distribution plots for SUV measurements of the BM and RSM indicated possible bias ( Fig. S10a-S10b, SDC).

Interobserver Variation
No significant bias P >0.0015 was found in the paired comparisons of the differences in interobserver measurements for all HTs. The mean of the differences for SUV parameters in all HTs between the two observers ranged from -0.28 to +0.15 (Table 4). SUV mean and SUV peak measurements both showed high interobserver agreement for MBP, BM and SM with wCV of ≤10.3% with narrow limits of repeatability in each tissue as shown in Table  4, for BM ( Fig. 2) and for other HTs ( Fig. S11-S20, SDC). The wCVs of LV were <21.6% indicating less agreement between observers with wider ranges of upper and lower RCs (Table 4). SUV mean of interobserver measurements in different lung zones showed high agreement in all lung zones with wCV of ≤10.5%, but SUV max and SUV peak measurements showed more variations specifically in lung RUZ with wCV measured 33.1% and 36.5%, respectively as presented in Table 4 ( Fig. 2).
Pearson's Correlation for interobserver measurements was not significant (P >0.0015) in any SUV parameters. Kendall's tau showed one strictly significant correlation (tau = -0.50, P <0.001) in SUV max of lung RLZ in the log-transformed data (Table S5, SDC). The Bland-Altman plots of SUV max and SUV peak in MBP look fairly biased ( Fig. S11a and S11c, SDC).

Discussion
The aims of this study were to assess the test-retest repeatability and interobserver variation of HT metabolism using the SUV parameters commonly applied in 18 F-FDG PET/CT imaging for cancer patients. We found no significant bias in the mean differences of both analyses (the test-retest) and (interobserver) measurements in different HTs.
Currently only metabolic activity of the liver and MBP are routinely used for response evaluation in patients with lymphoma [22]. Metabolic activity in the liver has been assessed by several studies [10][11][12]14,23] Deauville scale for lymphoma [22]. However, Kramer et al. reported that normalisation of tumour uptake to MBP was more variable at 90 min than 60 min after 18 F-FDG injection suggesting that MBP repeatability might be influenced by the uptake time [23]. Similar good repeatability was seen in the bone marrow in our study, with one outlier which may explain the slightly wider variations in the repeatability than other HTs as indicated by the wCVs.
Paquet et al. reported significant variations in the average difference of SUV max and SUV mean of SM which was not found in our analysis [10]. But similar to Paquet et al.
we observed an average decrease, although not significant, in 18 (Table 5) [10]. One of the suggested causes is the low 18 F-FDG uptake in the lungs leading to high susceptibility to noise when measuring SUV and the close proximity to the liver [10]. In contrast to the findings of Paquet et al., no significant variations were detected in the average differences in different lung zones in the current study. The better repeatability in our study might be due to the use of larger ROIs placed at least 2 cm away from the liver, a shorter period between the two scans (3.1 ± 1 vs. 271 ± 118 days) and a more rigorous repeatability study design unlike Paquet et al. where analysis was based on retrospective follow up of oncology patients [10].
In the repeatability study by Gheysens et al., a wCV of 20.7% was reported in the LV which was much lower than the wCVs we obtained for the different SUV parameters ranging from 63.60% to 69.00% [9]. The study by Gheysens et al. was conducted in 6 healthy individuals with a mean age ± SD of (32 ± 10 years) [9] rather than in cancer patients with a mean age of (68.6 ± 7.7 years) in our analysis. This may imply that the age and the physical condition of patients may affect the stability of 18 F-FDG uptake in LV, but also that variability in myocardial uptake may be more common in cancer patients [15]. The findings of Thut et al. from 20 patients with non-Hodgkin's lymphoma which showed high regional variability in LV with variable patterns of SUV max across different regions of the LV in several PET/CT scans regardless of the fasting period also support this hypothesis [15]. Quite similar observations were reported in a retrospective study by Inglese et al. in 49 patients with various malignancies during treatment which showed heterogeneity of uptake in different myocardial regions and high variability in the same region at different time points [14]. On the other hand, Kim et al. found no significant variations in volumetric measurements of myocardial 18 F-FDG uptake in patients with diffuse large B-cell lymphoma during treatment [25]. It may be preferable to use a more global assessment in addition to a regional evaluation when monitoring changes in 18 F-FDG metabolism in myocardium during treatment of cancer patients. However, the simple visual assessment performed in our study indicates that the low repeatability in LV measurements is unlikely to be primarily a result of the segmentation method applied, but simply reflecting the large inter-and intraindividual variations in myocardial uptake when patients have not been instructed to follow a low-carbohydrate diet and prolonged fasting [18] and questions the use of myocardial uptake as a prognostic marker [1] unless there are carefully controlled dietary conditions.
Image interpretation of HT metabolism may be inconsistent between observers when nonstandard methods are used for measuring 18 F-FDG uptake in HT. In our interobserver analysis using standardised placement of fixed VOIs we found low wCV indicating high agreement in MBP, BM and SM and moderate agreement in the LV for all SUV measurements. High agreement was also noted in SUV mean measurements between the observers in all lung zones, but more variability in SUV max and SUV peak measurements. To the best of our knowledge, interobserver agreement of 18 F-FDG uptake in HT has only been studied in the liver and brown adipose tissue [12,[26][27].
Other indirect interobserver analyses have been conducted. Burger et al. compared two methods for measuring the background activity from different healthy tissues as reference regions for malignant lesions [28]. The study showed excellent interobserver agreement in the VOI method used for mean background activity of the respective organs including the lung, liver, skin and neck [28]. Despite the high agreement in SUV mean in our interobserver measurements in the lungs, the variations in SUV max and SUV peak might be attributed to the observer dependent ROI sizes which may augment the effect of possible spillover from adjacent lung cancer lesions or physiological high uptake e.g. in the myocardium. Ohira et al. assessed interobserver variation of myocardium metabolism in 2 groups of patients on low-carbohydrate/high-fat diet and unrestricted diet respectively with cardiac sarcoidosis using pattern and regional ROI approaches [29]. Agreement in the pattern interpretation was moderate, but results showed a trend for improved agreement in the restricted diet group [29]. This may be a possible factor for the only moderate interobserver variation in the LV among patients in the current study who were not on cardiac-specific dietary restriction.
One of the interesting findings with regard to the methods of measurements is that we found large disparity in the test-retest and interobserver lung measurements of SUV max and SUV peak compared to SUV mean . Schwartz et al. pointed out in their phantom repeatability study that SUV max and SUV mean are similar when measured in smaller ROIs, more homogenous sources and at longer scan times (>3min/bed position), but the repeatability improves with larger objects and SUV mean has better repeatability regardless of the ROI size [30]. Because lung tissue is more susceptible to statistical errors, using larger ROIs and SUV mean for evaluating the variation in 18 F-FDG uptake in the lung is likely to be more reliable than smaller ROIs and values derived from SUV max or SUV peak .
With this study we also wished to formulate recommendations for future studies evaluating HT metabolism. Based on the presented repeatability analysis of HT metabolism we found both SUV peak and SUV mean were more stable in HT than SUV max . The overall test-retest repeatability and interobserver variation were better in HTs with moderate 18 F-FDG uptake (MBP and BM) and lower in tissues with low uptake (SM and lungs). These observations in HTs agree with the pattern of repeatability observed in measurement of tumours whereby the repeatability is improved as 18 F-FDG uptake increases [19]. Our findings also indicate that in organs with very low physiological 18 F-FDG uptake measurements using SUV mean in a larger ROI or segmented organ might be preferable. This, however, raises an issue regarding interobserver variation in defining these regions which may be improved by applying automated segmentation, e.g. based on artificial intelligence.
There are some limitations to our study. The retrospective analysis prevented the control of potential factors such as the restricted diet for LV assessment, nevertheless, analysis of other HTs under restricted diet would possibly not reflect the normal status of cancer patients having a clinical PET/CT scan. The assessment of the test-retest scans was performed by a less experienced observer, however, the subsequent interobserver analysis against an experienced reader showed good agreement and low interobserver variation. In addition, other HTs such as liver, spleen and bowel were not analysed because the original study required thoracic PET/CT scans only to assess lung cancer lesions repeatability using different breathing protocols. The test-retest and interobserver variation of liver SUV has however been previously reported [11][12][26][27]. As we only used data from lung cancer patients our results might not be applicable to patients with other types of malignancies. However, we consider the repeatability analysis and results are likely to apply across a broad range of cancers, especially as all scans were acquired prior to any treatment. Furthermore, it might be difficult to estimate correlations for the differences in the test-retest scans and interobserver measurements to the means due to random noise from the low range SUV in HT, combined with large numbers of statistical tests and small sample size. It would be desirable to validate our results in a larger independent sample, but conducting repeatability studies on large numbers of patients with repeat radiation exposure particularly for evaluation of HTs might not be feasible or ethical.

Conclusion
HT metabolism is stable in a test-retest scenario and has high interobserver agreement. The wCV of SUV mean measurements between the two scans were <20% and <10% between the observers, thus, variation in SUV mean of over 20% would indicate a true change. SUV mean is suggested as the most stable metric especially in organs with low 18 F-FDG uptake (SM and lungs). For HTs with moderate uptake (MBP and BM) SUV peak is suggested as the preferred metric. Test-retest measurements in LV were highly variable, irrespective of the SUV parameter used, although this might be reduced by considering automated segmentation and assessment methods that do not solely rely on regional analysis accompanied with dietary restrictions where feasible.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.  Interobserver variation of log-transformed SUV measurements including maximum, mean and peak in (a-c) bone marrow (BM) and (e-g) lung right upper zone (RUZ) illustrated by the Bland-Altman method. A simple linear regression indicated no significant bias in (a-c) BM and (e-g) lung RUZ data (P >0.0015). Scatter plots for distribution of interobserver measurements for different SUV parameters in (d) BM and (h) lung RUZ. Mean, mean of SUV difference between measurements of observer 1 and 2; URC/LRC, upper and lower repeatability coefficients; UCI/LCI, upper and lower 95% confidence intervals of the mean; SUV, standardized uptake value.  Table 1 Assessment methods to evaluate variability of FDG uptake in healthy tissue

Mediastinal blood pool
Descending aorta Placing of 1.5 cm diameter sphere using nudge tool on descending aorta [30 -31] to ensure the volume of interest (VOI) is not touching the wall of aorta (Guided by CT images to avoid artefacts from adjacent structures or atherosclerotic associated inflammation).

Myocardium
Lateral wall of the left ventricle (LV) Placing of 1.5 cm sphere using nudge tool at the highest uptake area in the lateral wall of the LV to stay within the wall boundaries in a mid-trans-axial PET/CT slice excluding artefact [30 -31].
Bone marrow Thoracic vertebral body at level of bifurcation of the carina Placing of 1.5 cm diameter sphere at mid-vertebral body in a trans-axial PET/CT slice with review of sagittal CT images to confirm accurate placement and avoid areas of focal uptake, artefacts, compression fracture or severe osteoarthritic changes.
Skeletal muscle Right and left teres major muscles Placing of 1.5 cm diameter sphere using nudge tool in teres major at each selected skeletal muscle excluding areas of focal uptake in a trans-axial PET/CT slice.

Lungs
Both lung zones Manual drawing of region of interest segmenting all of lung parenchyma leaving a margin to avoid overlap with pleura at a single slice in respective upper, middle and lower zones of the lungs [32], excluding the hilar vessels and any disease in a trans-axial PET/CT slice. ROI in the RLZ of the lung was placed at least 2 cm away from the liver. ROIs were used for lungs to avoid the inclusion of tumours or any possible areas of inflammation and large ROIs were drawn to reduce the possible noise effect [31] and to get better insight into SUV repeatability and variations in each zone.